library(ggplot2)
library(dplyr)
library(tidyverse)
library(plotly)

1 Details about the Project

1.1 Information about the dataset

  • In 2013, students of the Statistics class at FSEV UK were asked to invite their friends to participate in this survey.
  • The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).
  • For convenience, the original variable names were shortened in the data file.
  • File columns.csv represents the dataset for the shortenings made for each variable name
  • All participants were of Slovakian nationality (the questionnaire was later translated in English), aged between 15-30.

The variables can be split into the following fields:

  • Music preferences (19 items)
  • Movie preferences (12 items)
  • Hobbies & interests (32 items)
  • Phobias (10 items)
  • Health habits (3 items)
  • Personality traits, views on life, & opinions (57 items)
  • Spending habits (7 items)
  • Demographics (10 items)

1.2 Main Question

How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?

1.3 Example from a specific field

MUSIC PREFERENCES

  1. I enjoy listening to music.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
  2. I prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)
  3. Dance, Disco, Funk: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  4. Folk music: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  5. Country: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  6. Classical: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  7. Musicals: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  8. Pop: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  9. Rock: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)
  10. Metal, Hard rock: Don’t enjoy at all 1-2-3-4-5 Enjoy very much (integer)

Since there are so many fields and many variables, I took a specific field to analyze and see how can I represent all the variables into only one visualization in order to gain some insights from that specific field.

The chosen field is: MUSIC

2 Coding process

2.1 Import data

data = read.csv("responses.csv")
columns = read.csv("columns.csv")

head(data)

See the columns data

head(columns)

2.2 EDA & Preprocessing

Firstly, we will take a look at the type of our data

str(data)
## 'data.frame':    1010 obs. of  150 variables:
##  $ Music                         : int  5 4 5 5 5 5 5 5 5 5 ...
##  $ Slow.songs.or.fast.songs      : int  3 4 5 3 3 3 5 3 3 3 ...
##  $ Dance                         : int  2 2 2 2 4 2 5 3 3 2 ...
##  $ Folk                          : int  1 1 2 1 3 3 3 2 1 5 ...
##  $ Country                       : int  2 1 3 1 2 2 1 1 1 2 ...
##  $ Classical.music               : int  2 1 4 1 4 3 2 2 2 2 ...
##  $ Musical                       : int  1 2 5 1 3 3 2 2 4 5 ...
##  $ Pop                           : int  5 3 3 2 5 2 5 4 3 3 ...
##  $ Rock                          : int  5 5 5 2 3 5 3 5 5 5 ...
##  $ Metal.or.Hardrock             : int  1 4 3 1 1 5 1 1 5 2 ...
##  $ Punk                          : int  1 4 4 4 2 3 1 2 1 3 ...
##  $ Hiphop..Rap                   : int  1 1 1 2 5 4 3 3 1 2 ...
##  $ Reggae..Ska                   : int  1 3 4 2 3 3 1 2 2 4 ...
##  $ Swing..Jazz                   : int  1 1 3 1 2 4 1 2 2 4 ...
##  $ Rock.n.roll                   : int  3 4 5 2 1 4 2 3 2 4 ...
##  $ Alternative                   : int  1 4 5 5 2 5 3 1 NA 4 ...
##  $ Latino                        : int  1 2 5 1 4 3 3 2 1 5 ...
##  $ Techno..Trance                : int  1 1 1 2 2 1 5 3 1 1 ...
##  $ Opera                         : int  1 1 3 1 2 3 2 2 1 2 ...
##  $ Movies                        : int  5 5 5 5 5 5 4 5 5 5 ...
##  $ Horror                        : int  4 2 3 4 4 5 2 4 1 2 ...
##  $ Thriller                      : int  2 2 4 4 4 5 1 4 5 1 ...
##  $ Comedy                        : int  5 4 4 3 5 5 5 5 5 5 ...
##  $ Romantic                      : int  4 3 2 3 2 2 3 2 4 5 ...
##  $ Sci.fi                        : int  4 4 4 4 3 3 1 3 4 1 ...
##  $ War                           : int  1 1 2 3 3 3 3 3 5 3 ...
##  $ Fantasy.Fairy.tales           : int  5 3 5 1 4 4 5 4 4 4 ...
##  $ Animated                      : int  5 5 5 2 4 3 5 4 4 4 ...
##  $ Documentary                   : int  3 4 2 5 3 3 3 3 5 4 ...
##  $ Western                       : int  1 1 2 1 1 2 1 1 1 1 ...
##  $ Action                        : int  2 4 1 2 4 4 2 3 1 2 ...
##  $ History                       : int  1 1 1 4 3 5 3 5 3 3 ...
##  $ Psychology                    : int  5 3 2 4 2 3 3 2 2 2 ...
##  $ Politics                      : int  1 4 1 5 3 4 1 3 1 3 ...
##  $ Mathematics                   : int  3 5 5 4 2 2 1 1 1 3 ...
##  $ Physics                       : int  3 2 2 1 2 3 1 1 1 1 ...
##  $ Internet                      : int  5 4 4 3 2 4 2 5 1 5 ...
##  $ PC                            : int  3 4 2 1 2 4 1 4 1 1 ...
##  $ Economy.Management            : int  5 5 4 2 2 1 3 1 1 4 ...
##  $ Biology                       : int  3 1 1 3 3 4 5 2 3 2 ...
##  $ Chemistry                     : int  3 1 1 3 3 4 5 2 1 1 ...
##  $ Reading                       : int  3 4 5 5 5 3 3 2 5 4 ...
##  $ Geography                     : int  3 4 2 4 2 3 3 3 1 4 ...
##  $ Foreign.languages             : int  5 5 5 4 3 4 4 4 1 5 ...
##  $ Medicine                      : int  3 1 2 2 3 4 5 1 1 1 ...
##  $ Law                           : int  1 2 3 5 2 3 3 2 1 1 ...
##  $ Cars                          : int  1 2 1 1 3 5 4 1 1 1 ...
##  $ Art.exhibitions               : int  1 2 5 5 1 2 1 1 1 4 ...
##  $ Religion                      : int  1 1 5 4 4 2 1 2 2 4 ...
##  $ Countryside..outdoors         : int  5 1 5 1 4 5 4 2 4 4 ...
##  $ Dancing                       : int  3 1 5 1 1 1 3 1 1 5 ...
##  $ Musical.instruments           : int  3 1 5 1 3 5 2 1 2 3 ...
##  $ Writing                       : int  2 1 5 3 1 1 1 1 1 1 ...
##  $ Passive.sport                 : int  1 1 5 1 3 5 5 4 4 4 ...
##  $ Active.sport                  : int  5 1 2 1 1 4 3 5 1 4 ...
##  $ Gardening                     : int  5 1 1 1 4 2 3 1 1 1 ...
##  $ Celebrities                   : int  1 2 1 2 3 1 1 3 5 2 ...
##  $ Shopping                      : int  4 3 4 4 3 2 3 3 2 4 ...
##  $ Science.and.technology        : int  4 3 2 3 3 3 4 2 1 3 ...
##  $ Theatre                       : int  2 2 5 1 2 1 3 2 5 5 ...
##  $ Fun.with.friends              : int  5 4 5 2 4 3 5 4 4 5 ...
##  $ Adrenaline.sports             : int  4 2 5 1 2 3 1 2 1 2 ...
##  $ Pets                          : int  4 5 5 1 1 2 5 5 1 2 ...
##  $ Flying                        : int  1 1 1 2 1 3 1 3 2 4 ...
##  $ Storm                         : int  1 1 1 1 2 2 3 2 3 5 ...
##  $ Darkness                      : int  1 1 1 1 1 2 2 4 1 4 ...
##  $ Heights                       : int  1 2 1 3 1 2 1 3 5 5 ...
##  $ Spiders                       : int  1 1 1 5 1 1 1 1 5 3 ...
##  $ Snakes                        : int  5 1 1 5 1 2 5 5 5 4 ...
##  $ Rats                          : int  3 1 1 5 2 2 1 3 2 4 ...
##  $ Ageing                        : int  1 3 1 4 2 1 4 1 2 3 ...
##  $ Dangerous.dogs                : int  3 1 1 5 4 1 1 2 3 5 ...
##  $ Fear.of.public.speaking       : int  2 4 2 5 3 3 1 4 4 3 ...
##  $ Smoking                       : chr  "never smoked" "never smoked" "tried smoking" "former smoker" ...
##  $ Alcohol                       : chr  "drink a lot" "drink a lot" "drink a lot" "drink a lot" ...
##  $ Healthy.eating                : int  4 3 3 3 4 2 4 2 1 3 ...
##  $ Daily.events                  : int  2 3 1 4 3 2 3 3 1 4 ...
##  $ Prioritising.workload         : int  2 2 2 4 1 2 5 1 2 2 ...
##  $ Writing.notes                 : int  5 4 5 4 2 3 5 3 1 2 ...
##  $ Workaholism                   : int  4 5 3 5 3 3 5 2 4 3 ...
##  $ Thinking.ahead                : int  2 4 5 3 5 3 3 4 2 3 ...
##  $ Final.judgement               : int  5 1 3 1 5 1 3 3 5 5 ...
##  $ Reliability                   : int  4 4 4 3 5 3 4 3 5 4 ...
##  $ Keeping.promises              : int  4 4 5 4 4 4 5 3 4 5 ...
##  $ Loss.of.interest              : int  1 3 1 5 2 3 3 1 1 3 ...
##  $ Friends.versus.money          : int  3 4 5 2 3 2 4 4 4 4 ...
##  $ Funniness                     : int  5 3 2 1 3 3 4 4 2 3 ...
##  $ Fake                          : int  1 2 4 1 2 1 1 2 2 1 ...
##  $ Criminal.damage               : int  1 1 1 5 1 4 2 1 1 2 ...
##  $ Decision.making               : int  3 2 3 5 3 2 2 3 4 5 ...
##  $ Elections                     : int  4 5 5 5 5 5 5 5 1 5 ...
##  $ Self.criticism                : int  1 4 4 5 5 4 3 3 3 4 ...
##  $ Judgment.calls                : int  3 4 4 4 5 4 5 5 2 5 ...
##  $ Hypochondria                  : int  1 1 1 3 1 1 1 2 2 1 ...
##  $ Empathy                       : int  3 2 5 3 3 4 4 1 5 4 ...
##  $ Eating.to.survive             : int  1 1 5 1 1 2 1 2 1 1 ...
##  $ Giving                        : int  4 2 5 1 3 3 5 3 1 4 ...
##  $ Compassion.to.animals         : int  5 4 4 2 3 5 5 5 4 5 ...
##  $ Borrowed.stuff                : int  4 3 2 5 4 5 5 2 5 4 ...
##   [list output truncated]

Secondly, we will split each category in dataframes for a better management

# Split into DFs
music_dataframe = data.frame(data[1:19])

movies_dataframe = data.frame(data[20:31])

hobbies_dataframe = data.frame(data[32:63])

phobias_dataframe = data.frame(data[64:73])

health_habits_dataframe = data.frame(data[74:76])

personality_dataframe = data.frame(data[78:133])

spendings_dataframe = data.frame(data[134:140])

demographics_dataframe = data.frame(data[141:150])

Now, we will take a look how much people enjoy the Music in general

Firstly create a dataframe of the frequencies for the specific question

overall_music = music_dataframe %>% 
  group_by(Music) %>%
  summarise(counts = n())

overall_music
overall_music_plot = ggplot(overall_music, aes(x = Music, y = counts, fill=counts)) +
  geom_bar(stat = "identity", fill="steelblue")+
  geom_text(aes(label= counts), vjust = -0.3)+
  theme(axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5))+
  labs(y="Number of respondents")+
  ggtitle("How much people enjoy Music?")+
  geom_text(x=1, y=770, label="1 - Strongly disagree", color="red", hjust = 0, size = 5)+
  geom_text(x=1, y=730, label="5 - Strongly agree", color = "blue", hjust = 0, size = 5)+
  scale_y_continuous(breaks=round(seq(0, max(overall_music$counts), by=100), 1))

overall_music_plot

2.3 Convert data in another format

Now comes the hard part

We want to convert the music_dataframe into the following table example:

Question Strongly_disagree Disagree Neutral Agree Strongly agree
Music 19 20 31 42 50
Dance 11 22 33 44 55

This table represents the following:

  • The columns are: the question and the response (1-5) converted into Strongly disagree-Strongly agree
  • Each row represents a question and the count of each response

For this, we will take the columns and retain it into a variable

## Convert columns to Rows
Question = colnames(music_dataframe)
Question
##  [1] "Music"                    "Slow.songs.or.fast.songs"
##  [3] "Dance"                    "Folk"                    
##  [5] "Country"                  "Classical.music"         
##  [7] "Musical"                  "Pop"                     
##  [9] "Rock"                     "Metal.or.Hardrock"       
## [11] "Punk"                     "Hiphop..Rap"             
## [13] "Reggae..Ska"              "Swing..Jazz"             
## [15] "Rock.n.roll"              "Alternative"             
## [17] "Latino"                   "Techno..Trance"          
## [19] "Opera"

Then we create a dataframe with the column “Question” and add as values each Category from the Field of Music After that, we will add as columns the available responses with NA values (since this needs to be preprocessed first)

df = data.frame(Question)

df = df %>%
  add_column("Strongly_disagree" = NA,
             "Disagree" = NA,
             "Neutral" = NA,
             "Agree" = NA,
             "Strongly_agree" = NA)

df

Now we will count and group each occurrence for each response (1-5) and store it in occurrences table This table will be unzipped then because we are interested only in the values, not in the columns

# Count occurrences of each value in each column
occurrences <- lapply(music_dataframe, table)
values <- lapply(occurrences, function(x) as.vector(unname(x)))
print(occurrences)
## $Music
## 
##   1   2   3   4   5 
##   9  11  36 129 822 
## 
## $Slow.songs.or.fast.songs
## 
##   1   2   3   4   5 
##  23  66 580 235 104 
## 
## $Dance
## 
##   1   2   3   4   5 
##  94 215 321 235 141 
## 
## $Folk
## 
##   1   2   3   4   5 
## 290 336 235  87  57 
## 
## $Country
## 
##   1   2   3   4   5 
## 343 346 198  85  33 
## 
## $Classical.music
## 
##   1   2   3   4   5 
## 138 249 281 189 146 
## 
## $Musical
## 
##   1   2   3   4   5 
## 195 248 283 166 116 
## 
## $Pop
## 
##   1   2   3   4   5 
##  60 155 260 314 218 
## 
## $Rock
## 
##   1   2   3   4   5 
##  55 105 207 294 343 
## 
## $Metal.or.Hardrock
## 
##   1   2   3   4   5 
## 384 219 164 136 104 
## 
## $Punk
## 
##   1   2   3   4   5 
## 319 227 220 152  84 
## 
## $Hiphop..Rap
## 
##   1   2   3   4   5 
## 212 206 206 224 158 
## 
## $Reggae..Ska
## 
##   1   2   3   4   5 
## 179 252 285 195  92 
## 
## $Swing..Jazz
## 
##   1   2   3   4   5 
## 204 231 271 198 100 
## 
## $Rock.n.roll
## 
##   1   2   3   4   5 
## 116 191 298 231 167 
## 
## $Alternative
## 
##   1   2   3   4   5 
## 213 220 247 172 151 
## 
## $Latino
## 
##   1   2   3   4   5 
## 191 250 238 172 151 
## 
## $Techno..Trance
## 
##   1   2   3   4   5 
## 380 207 194 140  82 
## 
## $Opera
## 
##   1   2   3   4   5 
## 389 294 175  98  53
values
## $Music
## [1]   9  11  36 129 822
## 
## $Slow.songs.or.fast.songs
## [1]  23  66 580 235 104
## 
## $Dance
## [1]  94 215 321 235 141
## 
## $Folk
## [1] 290 336 235  87  57
## 
## $Country
## [1] 343 346 198  85  33
## 
## $Classical.music
## [1] 138 249 281 189 146
## 
## $Musical
## [1] 195 248 283 166 116
## 
## $Pop
## [1]  60 155 260 314 218
## 
## $Rock
## [1]  55 105 207 294 343
## 
## $Metal.or.Hardrock
## [1] 384 219 164 136 104
## 
## $Punk
## [1] 319 227 220 152  84
## 
## $Hiphop..Rap
## [1] 212 206 206 224 158
## 
## $Reggae..Ska
## [1] 179 252 285 195  92
## 
## $Swing..Jazz
## [1] 204 231 271 198 100
## 
## $Rock.n.roll
## [1] 116 191 298 231 167
## 
## $Alternative
## [1] 213 220 247 172 151
## 
## $Latino
## [1] 191 250 238 172 151
## 
## $Techno..Trance
## [1] 380 207 194 140  82
## 
## $Opera
## [1] 389 294 175  98  53

Then we manually add these values for each ROW starting from the second column until the end P.S.: I added manually because the ‘for’ loop didn’t want to help me

### INSERT VALUES FOR THE COLUMNS OF EACH QUESTION
df[1, 2:(1 + length(values$Music))] <- c(values$Music, df[2, -(1:(1 + length(values$Music)))])
df[2, 2:(1 + length(values$Slow.songs.or.fast.songs))] <- c(values$Slow.songs.or.fast.songs, df[2, -(1:(1 + length(values$Slow.songs.or.fast.songs)))])
df[3, 2:(1 + length(values$Dance))] <- c(values$Dance, df[2, -(1:(1 + length(values$Dance)))])
df[4, 2:(1 + length(values$Folk))] <- c(values$Folk, df[2, -(1:(1 + length(values$Folk)))])
df[5, 2:(1 + length(values$Country))] <- c(values$Country, df[2, -(1:(1 + length(values$Country)))])
df[6, 2:(1 + length(values$Classical.music))] <- c(values$Classical.music, df[2, -(1:(1 + length(values$Classical.music)))])
df[7, 2:(1 + length(values$Musical))] <- c(values$Musical, df[2, -(1:(1 + length(values$Musical)))])
df[8, 2:(1 + length(values$Pop))] <- c(values$Pop, df[2, -(1:(1 + length(values$Pop)))])
df[9, 2:(1 + length(values$Rock))] <- c(values$Rock, df[2, -(1:(1 + length(values$Rock)))])
df[10, 2:(1 + length(values$Metal.or.Hardrock))] <- c(values$Metal.or.Hardrock, df[2, -(1:(1 + length(values$Metal.or.Hardrock)))])
df[11, 2:(1 + length(values$Punk))] <- c(values$Punk, df[2, -(1:(1 + length(values$Punk)))])
df[12, 2:(1 + length(values$Hiphop..Rap))] <- c(values$Hiphop..Rap, df[2, -(1:(1 + length(values$Hiphop..Rap)))])
df[13, 2:(1 + length(values$Reggae..Ska))] <- c(values$Reggae..Ska, df[2, -(1:(1 + length(values$Reggae..Ska)))])
df[14, 2:(1 + length(values$Swing..Jazz))] <- c(values$Swing..Jazz, df[2, -(1:(1 + length(values$Swing..Jazz)))])
df[15, 2:(1 + length(values$Rock.n.roll))] <- c(values$Rock.n.roll, df[2, -(1:(1 + length(values$Rock.n.roll)))])
df[16, 2:(1 + length(values$Alternative))] <- c(values$Alternative, df[2, -(1:(1 + length(values$Alternative)))])
df[17, 2:(1 + length(values$Latino))] <- c(values$Latino, df[2, -(1:(1 + length(values$Latino)))])
df[18, 2:(1 + length(values$Techno..Trance))] <- c(values$Techno..Trance, df[2, -(1:(1 + length(values$Techno..Trance)))])
df[19, 2:(1 + length(values$Opera))] <- c(values$Opera, df[2, -(1:(1 + length(values$Opera)))])

df

2.4 Plot creation

For the plot creation we will choose a stacked bar chart, since we have discrete values (number of occurrences) combined with categorical values (Strongly disagree-Strongly agree) and nominal values (Music, Dance, Folk, Opera, etc.)

For this we will transform the dataframe into a long format using tidyr

df_long <- tidyr::gather(df, Answer, Occurrences, -Question)
df_long

Now we want also to calculate the percentages of each response from each Category

# Calculate percentages
df_long <- transform(df_long, Percentage = Occurrences / tapply(Occurrences, Question, sum)[Question] * 100)

# Reorder the levels for Answer column
df_long$Answer <- factor(df_long$Answer, levels = c("Strongly_agree", "Agree", "Neutral", "Disagree", "Strongly_disagree"))

df_long

And now we will create the wanted stacked bar chart

# Horizontal Stacked Bar Chart
stacked_music_field <- ggplot(df_long, aes(x = Percentage, y = Question, fill = Answer, text = paste("Answer: ", Answer, "<br>Occurrences: ", Occurrences, "<br>Percentage: ", Percentage, "%"))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = ifelse(Percentage < 3, "", paste0(round(Percentage, 1), "%"))),
            position = position_stack(vjust = 0.5), color = "blue", size = 3.5) +
  geom_text(aes(label = Occurrences), size = 0, alpha = 0) +  # Add hidden Occurrences
  labs(x = "Percentage", y = "Category") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("How much people enjoy the following Categories?")+
  scale_fill_manual(name = "Scale", values = c("seagreen4", "green", "grey", "chocolate1", "tomato2"), labels = c("Strongly disagree", "Disagree", "Neutral", "Agree", "Strongly agree"))

# Convert ggplot to a plotly plot
stacked_music_field <- ggplotly(stacked_music_field, tooltip = "text")

# Modify the tooltip text
stacked_music_field$x$data[[1]]$text <- df_long$Answer

# Print the plot
stacked_music_field

Note: As we seen from first section, the Question: “Do you prefer Slow songs or Fast songs?” is valued as the following:

  • 1 - Slow songs
  • 5 - Fast songs

3 Conclusions

This project helped me to draw some important conclusion during the coding process:

  • The transition from wide to long format is quite hard
  • From 0 ideas of how to represent so many variables, I came to over 5 ideas and most of them were way too complex
  • Stacked bar chart represents a simple & very efficient way to visualize multiple variables with many levels of responses
  • People really dislike Metal/Hardrock type, but they Agree to Rock type… weird
  • There are other many ideas different from this graph such that the data was represented in a possible better way, but the time will increase exponetially
  • There is still place for improvement even for this graph (Personally I know at least 2 things that I can improve)
  • My expected time spent on this project was 5-7 hours, but it ended with 20+ hours…